Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SEDONA-714] Add geopandas to spark arrow conversion. #1825

Merged
merged 9 commits into from
Feb 26, 2025

Conversation

Imbruced
Copy link
Member

Did you read the Contributor Guide?

Is this PR related to a JIRA ticket?

  • Yes, the URL of the associated JIRA ticket is https://issues.apache.org/jira/browse/SEDONA-XXX. The PR name follows the format [SEDONA-XXX] my subject.

  • No:

    • this is a documentation update. The PR name follows the format [DOCS] my subject
    • this is a CI update. The PR name follows the format [CI] my subject

What changes were proposed in this PR?

How was this patch tested?

Did this PR include necessary documentation updates?

  • Yes, I am adding a new API. I am using the current SNAPSHOT version number in vX.Y.Z format.
  • Yes, I have updated the documentation.
  • No, this PR does not affect any public API so no need to change the documentation.

@jiayuasu
Copy link
Member

@paleolimbot

@Imbruced
Copy link
Member Author

I ll fix the missing function issue

@Imbruced
Copy link
Member Author

Starting from Spark 4.0, we can pass the whole arrow table to Spark.createDataFrame. I don't know when the release will be.

Copy link
Member

@paleolimbot paleolimbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is awesome! I'm new to this code base, so consider my comments optional nits 🙂

Starting from Spark 4.0, we can pass the whole arrow table to Spark.createDataFrame

Based on this PR I'm happy to attempt backporting GeoArrow import of anything implementing __arrow_c_stream__, circumventing a materialize of the GeoPandas data frame as a follow-up 🙂

from pyspark.sql import SparkSession
from pyspark.sql import DataFrame
from pyspark.sql.types import StructType, StructField, DataType, ArrayType, MapType
import pyarrow as pa
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure what the dependency situation is like for spark, but it may be worth making this a lazy import (e.g., like in dataframe_to_arrow so that when we import from seconda.utils.geoarrow from sedona/spark/__init__.py we don't necessarily require pyarrow to be installed (alternatively, we could add pyarrow to the apache-sedona[spark] extras to match the runtime requirement).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea. I'll make all the changes later today. Thank you for the review!

return [gen_new_name[name]() for name in names]


def _deduplicate_field_names(dt: DataType) -> DataType:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def _deduplicate_field_names(dt: DataType) -> DataType:
# Backport from Spark 4.0
# https://github.com/apache/spark/blob/3515b207c41d78194d11933cd04bddc21f8418dd/python/pyspark/sql/pandas/types.py#L1385
def _deduplicate_field_names(dt: DataType) -> DataType:

@Imbruced
Copy link
Member Author

This is awesome! I'm new to this code base, so consider my comments optional nits 🙂

Starting from Spark 4.0, we can pass the whole arrow table to Spark.createDataFrame

Based on this PR, I'm happy to attempt backporting GeoArrow import of anything implementing __arrow_c_stream__, circumventing a materialize of the GeoPandas data frame as a follow-up 🙂

@paleolimbot
That would be great. When writing Chapter 6, I included examples with the code you provided, which significantly improves the transformation time when we create geopandas from Sedona. Adding something similar to Sedona from Geopandas would be great, so I added this MR. I am happy to apply all the changes you mentioned.

@jiayuasu jiayuasu changed the title SEDONA-714 Add geopandas to spark arrow conversion. [SEDONA-714] Add geopandas to spark arrow conversion. Feb 24, 2025
@Imbruced Imbruced marked this pull request as ready for review February 25, 2025 10:18
@Imbruced Imbruced requested a review from jiayuasu as a code owner February 25, 2025 10:18
Copy link
Member

@jiayuasu jiayuasu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add documentation to this page? https://sedona.apache.org/latest/tutorial/geopandas-shapely/

@Imbruced
Copy link
Member Author

Can you add documentation to this page? https://sedona.apache.org/latest/tutorial/geopandas-shapely/

sure

@Imbruced Imbruced force-pushed the SEDONA-714-add-geopandas-to-spark-arrow-conversion branch from f328661 to 1c96da0 Compare February 25, 2025 22:17
@jiayuasu jiayuasu added this to the sedona-1.7.1 milestone Feb 26, 2025
@jiayuasu jiayuasu merged commit ebd6f67 into master Feb 26, 2025
28 checks passed
Copy link
Member

@paleolimbot paleolimbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies for the late review...this is awesome! Thank you!

@jiayuasu jiayuasu deleted the SEDONA-714-add-geopandas-to-spark-arrow-conversion branch February 28, 2025 17:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants